Open Source Data Science: An R Perspective

author: Joseph B. Rickert date: September 18, 2019 autosize: true

Center for Strategic and Budgetary Assessments
Implications of Data Science as a Resource
Workshop 3

RStudio

Privately held company (Boston)


Mission

R Consortium

left: 45%

Non Profit Membership Corporation - Organized under the Linux Foundation - Governed by a Board of Directors - Technical committee (ISC) funds projects and oversees work - More that $1M awarded so far


Mission

Open Source


Photo by Alex Holyoake

What is Open Source?

Source: Linux Foundation

For more open source information TODO

Some Open Source Statistics

Engaging with Open Source

Source: Linux Foundation Enterprise Open Source: A Practical Introduction

Open Source Strengths

Open Source Weaknesses

The R Project

left: 40%


Some History - 1995: R released as open-source
- 1997: R Core Group formed
- 1997: CRAN starts with 12 pkgs
- 2000: R 1.0.0 released
- 2001: Bioconductor Project
- 2003: R Foundation formed
- 2004: First useR! conf - 2009: NY Times article on R
- 2015: The R Consortium
- 2019: CRAN near 15K pkgs

S was conceived as an interface

left: 60%

John Chambers’ famous diagram from May 1976 indicates the intention to design a software interface to call an arbitrary FORTRAN subroutine, ABC, by wrapping it in some simplified calling syntax: XABC( ).


The main idea was to bring the best computational facilities to the people doing the analysis. As John phrased it: “combine serious computational challenges with convenience”

R is more than a language!

“It was always understood that R is meant to build on a base of computational tools. R relies on the ability of functions to communicate with, and exchange objects with other software.”

John Chambers, Extending R, CRC Press (2016)

Some Characteristics of the R Language

R is an interpreted scripting language - Base R has a relatively small footprint - The majority of growth and innovation comes from contributed packages: libraries of functions. - Features such as non-standard evaluation make it a good choice for “Design Specific Languages”


R is a functional, object based language - Everything that exists in R is an object - Everything that happens in R is a function call - Interfaces to other software are part of R

A Brief Example

NOAA-Polar-Ice.nb.html

The R Ecosystem

Open Source Data Science

left: 50%

Image by Jingwen Zheng


Machine Learning Algorithms

https://cran.r-project.org/web/views/MachineLearning.html

Machine Learning with Caret Package

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models.

tidymodels: a new framework for predictive modeling

Validation

Package Level - Test Programs are available for inspection and use - E.g. Almost 10K lines of test code for survival package

Industry/company Critical Collections - E.g. Pharmaceutical Industry R Validation Hub

Data Science Workflow


Interoperable Software: R / Python / etc.

https://www.tensorflow.org ***

TensorFlow: Production Grade ML

Key TensorFlow Concepts

See the Google Brain paper from Abadi et al. (2017) for the details.


R Interfaces to TensorFlow

left: 60%


Reproducibility

R Markdown

https://rmarkdown.rstudio.com


Reproducibly integrate code and text

Use notebooks to weave together narrative text and code from several computer languages (including R, Python, and SQL)to produce formatted output for several document types (HTML, pdf, etc) using multiple languages.

Production Data Sources

left: 40%

https://spark.rstudio.com


Connect to Spark Clusters - Use as backend to dplyr - Filter and aggregate Spark data sets - Use Spark’s MLlib

Production Workflows / Pipelines


Model Management Workflow Example

https://solutions.rstudio.com/model-management/overview/

Production Deployment with Containers

Kubernetes and Docker Pipeline


Containers: - Package up everything needed to run App - Reliably deploy Apps on multiple platforms - Distribute Apps across clusters

Content Deployment

https://solutions.rstudio.com/deploy/overview/

Communicate and Collaborate

https://shiny.rstudio.com


Some Shiny Examples

For many more see: - The Shiny Gallery - The Shiny User Showcase - showmeshiny

Thank You

@RStudioJoe

This presentation available at: https://github.com/joseph-rickert/OSD-Sept-2019